feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144
feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144amabito wants to merge 1 commit intoagentcontrol:mainfrom
Conversation
lan17
left a comment
There was a problem hiding this comment.
Thank you for working on this! It looks overall in the right direction to me, but I think we can improve the design a bit to make it more universal.
Could you also implement this as a contrib evaluator so it doesn't become a built-in evaluator right away? We will then put it to use and once its in a production ready stage we can move to builtin.
| scope: dict[str, str] = Field(default_factory=dict) | ||
| per: str | None = None | ||
| window: Literal["daily", "weekly", "monthly"] | None = None | ||
| limit_usd: float | None = None |
There was a problem hiding this comment.
It seems like this is a wrong abstraction for budgeting. What if the budget is in different currency? Why do we need separate fields for tokens vs usd?
It might be better to do something like an integer for a limit and then define "currency" Enum which could be USD, tokens, Euros, etc.,
I don't think there's a use case for having floating point for USD or Euros, no?
There was a problem hiding this comment.
Good point. I'll switch to an integer limit + a Currency enum (USD, EUR, tokens, etc.). Float was unnecessary — cents-level precision is handled by using integer cents anyway.
|
|
||
| scope: dict[str, str] = Field(default_factory=dict) | ||
| per: str | None = None | ||
| window: Literal["daily", "weekly", "monthly"] | None = None |
There was a problem hiding this comment.
What if we want hourly or half-an-hour?
Maybe its better to define "window" as an integer in seconds, or minutes? That way you can express whatever window you want.
There was a problem hiding this comment.
Agreed. Will change to window_seconds: int. I'll keep a few named constants (DAILY = 86400 etc.) as convenience but the field itself will be raw seconds.
| on the first breach. | ||
|
|
||
| Attributes: | ||
| scope: Static scope dimensions that must match for this rule |
There was a problem hiding this comment.
Worth it to give some examples here for scope?
There was a problem hiding this comment.
Sure, will add. Something like {"agent": "summarizer", "channel": "slack"}.
| limits: list[BudgetLimitRule] = Field(min_length=1) | ||
| pricing: dict[str, dict[str, float]] | None = None | ||
| token_path: str | None = None | ||
| cost_path: str | None = None |
There was a problem hiding this comment.
Shouldn't this evaluator be computing cost in USD based on model and token count?
Doesn't seem like an LLM step should be passing it down here. I maybe wrong on this.
There was a problem hiding this comment.
My thinking was: the caller already has cost from the LLM response, so passing it avoids maintaining a pricing table. But I see the argument — if the evaluator owns cost computation, the contract is simpler and the caller can't lie about cost. One question: should the evaluator maintain its own pricing table, or pull from an external source (e.g. LiteLLM's model cost map)?
There was a problem hiding this comment.
I believe you already include a pricing table in the evaluator config, so for version 1 of this evaluator we can just rely on that?
There was a problem hiding this comment.
Makes sense. I'll have the evaluator compute cost from the config pricing table + model + token counts. That lets me drop cost_path entirely — just need token_path and a new model_path to extract the model name from the input. If the model isn't in the pricing table, fail closed.
| return max(ratios) if ratios else 0.0 | ||
|
|
||
|
|
||
| class InMemoryBudgetStore: |
There was a problem hiding this comment.
nit: This should go into a separate file so that interface for store and its implementation are separate.
There was a problem hiding this comment.
Will split into store.py (protocol) and memory_store.py (InMemoryBudgetStore).
| """Atomically record usage and return current budget state. | ||
|
|
||
| Args: | ||
| scope_key: Composite scope identifier (e.g. "channel=slack|user_id=u1"). |
There was a problem hiding this comment.
this doc should go into interface docs instead of here
There was a problem hiding this comment.
Yep, will move to the protocol definition.
| def record_and_check( | ||
| self, | ||
| scope_key: str, | ||
| period_key: str, |
There was a problem hiding this comment.
Why is this needed to be passed in? Shouldn't implementation be figuring this out on its own based on current time?
There was a problem hiding this comment.
Fair. The store should derive the period from window_seconds + current time internally. I was passing it in for testability but I can inject a clock instead.
| input_tokens: int, | ||
| output_tokens: int, | ||
| cost_usd: float, | ||
| limit_usd: float | None = None, |
There was a problem hiding this comment.
Why do we need to pass in limit here? Can't we instantiate the store with the BudgetEvaluatorConfig or something similar so it knows what limits are already?
If we want to share the store between many different kinds of limits and keys, can't we just pass in BudgetLimitRule here instead?
There was a problem hiding this comment.
Makes sense. I'll have the store accept BudgetLimitRule at registration time so it knows its own limits. record_and_check just takes usage data.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| class TestInMemoryBudgetStore: |
There was a problem hiding this comment.
Tests should follow Given/When/Then behavioral comment style.
There was a problem hiding this comment.
Will restructure. Quick example to confirm this is what you mean:
def test_single_record_under_limit(self):
# Given: store with a $10 daily limit
# When: record $3 of usage
# Then: not breached, ratio ~0.3| Attributes: | ||
| scope: Static scope dimensions that must match for this rule | ||
| to apply. Empty dict = global rule. | ||
| per: If set, the limit is applied independently for each unique |
There was a problem hiding this comment.
I don't quite understand this. Can't we just handle separate budgets by having multiple rules with different scope dicts?
There was a problem hiding this comment.
For static scopes, agreed — multiple rules work. My concern is the dynamic case: "each user gets $10/day" where users aren't known at config time. With per, one rule covers all future users. Without it, you'd need to generate rules on the fly. Would a group_by field work? e.g. group_by: "user_id" means "apply this limit independently per distinct user_id value." Open to other approaches if you have something in mind.
|
@lan17 Thanks for the review. Moving to contrib — no objection, will restructure. Responded to each thread inline. Aiming for R2 within a week. |
any updates @amabito ? Would it be ok with you if we take over this PR to get it over the finish line? This feature is going to be super useful to the whole project! |
|
@lan17 Hey — update is mostly ready, was waiting on your reply on the cost computation thread before pushing. Got that answer now so I'll push the revision shortly. |
…cking New contrib evaluator "budget" that tracks cumulative token/cost usage per agent, channel, user. Configurable time windows via window_seconds. Design per reviewer feedback: - Contrib evaluator (not builtin) for production hardening - Integer limit + Currency enum (USD/EUR/tokens) - window_seconds (int) instead of named windows - group_by for dynamic per-user/per-channel budgets - Evaluator owns cost computation from pricing table - BudgetStore protocol + InMemoryBudgetStore (dict + Lock) - Store derives period keys internally, injectable clock Addresses agentcontrol#130. 55 tests (incl. thread safety, NaN/Inf, scope injection, double-count).
fa414d7 to
a1345b9
Compare
Summary
per agent, channel, user. Configurable time windows (daily/weekly/monthly).
Scope
User-facing/API changes:
Internal changes:
Out of scope:
Risk and Rollout
Testing
Checklist